Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm: LoRa support #4

Open
wants to merge 10 commits into
base: llm
Choose a base branch
from
Open

llm: LoRa support #4

wants to merge 10 commits into from

Conversation

kyriediculous
Copy link

This PR introduces support for Low-Rank Adaptation (LoRA) in our LLM inference pipeline, allowing for dynamic model adaptation at inference time. Key changes include:

  1. Enhanced LLMGeneratePipeline:

    • Added LoRA weight application functionality
    • Implemented a queue system for handling concurrent requests with different LoRA weights
    • Retained all existing model loading and optimization strategies (8-bit quantization, fp16/bf16 loading, distributed loading)
  2. Updated API route:

    • Added support for LoRA weights as an optional parameter
    • Implemented validation for LoRA weight input (base64 encoded string)
  3. Maintained compatibility:

    • Preserved all existing pipeline functionality (memory management, device mapping, stable-fast optimization)
    • Ensured backward compatibility for requests without LoRA weights
  4. Updated dependencies:

    • Verified and updated requirements.txt to support new functionality

These changes enable users to apply custom LoRA weights to the base model at inference time, allowing for tailored model behavior without changing the underlying model. This feature enhances the flexibility of our inference API while maintaining its performance and efficiency.

Testing:

  • Verified functionality with and without LoRA weights
  • Tested concurrent requests with different LoRA weights
  • Ensured compatibility with existing pipelines and configurations

Open todo's:

  • Comprehensive testing with various model sizes and LoRA configurations
  • Performance benchmarking to assess impact on inference speed
  • Documentation update to explain LoRA usage in API calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant